library(readr)
library(textstem)
library(stringr)
library(tidyr)
library(dplyr)
library(stopwords)
library(tidytext)
library(tidyverse)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(syuzhet)
library(ggplot2)
library(stringi)
library(ggplot2)
library(tibble)
library(textdata)
library(Hmisc)
library(sentimentr)
library(zoo)
library(flextable)
library(tm)
library(magrittr)
library(lexicon)
library(broom)
library(wordcloud)
library(ggplot2)
library(qdap)
library(metricsgraphics)
library(ggthemes)
library(radarchart)
library(treemap)
library(ggraph)
library(lattice)
library(igraph)
library(udpipe)
library(widyr)
library(topicmodels)
library(rJava)
library(mallet)
library(LDAvis)
library(servr)
model_loaded <- udpipe::udpipe_download_model(language = "english-ewt")
Data Description
Our date has been specially stocked by us and is a collection of jokes and anecdotes from the site Reddit[1].
‘Reddit’ is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. In reddit there are subreddits. A subreddit is a specific online community, and the posts associated with it, on the social media website Reddit. Subreddits are dedicated to a particular topic that people write about, and they’re denoted by /r/, followed by the subreddit’s name, e.g., /r/gaming.
In our work we regarding subreddit “jokes” from the TOP page. The texts include the jokes from 5 to 350 words in each text. The problem of parcing was that the title downloaded in different column that the main body of the text but in the most cases the title was the beginning of the joke. Foe instance, title: What do you get when you mix Napolean Dynamite and Napolean Bonaparte? and the main text: Napolean Blownapart. So, we had to join this two colums in one text and we named this column text. It will be the 1st column of our importance, Also we have the column “comments”. It will be also actively partcicipate in our work. We in advance deleted the url of the reddit and other stuff that we were not using in our analysis.
data <- read_csv("jokes_file.csv")
head(data)
## # A tibble: 6 × 5
## date_utc title text text …¹ comme…²
## <date> <chr> <chr> <chr> <dbl>
## 1 NA <NA> <NA> <NA> NA
## 2 2022-11-19 "What four elements are you not allowed to b… "Wha… "Nitro… 17
## 3 2022-11-19 "Why do farts smell?" "Why… "So de… 7
## 4 2022-11-06 "Two great opportunities are walking past a … "Two… "\nThe… 1
## 5 2022-12-05 "What was the snowman doing in the vegetable… "Wha… "Picki… 2
## 6 2022-11-22 "Check out \"conjunctivitis.com\"." "Che… "Its a… 1
## # … with abbreviated variable names ¹`text (without title)`, ²comments
The date consists of 1000 posts (in our case the posts) and 5 columns (among them the date of publication, the title, the text of the joke/anecdote and the number of comments under that joke)
From that we will necessarily use the publication date, the number of comments and the text itself. Thanks to these parameters we will analyze the jokes, the content of the jokes and their popularity.
Develop a doc_id identifier of each reddit text. It will be useful to understand from which text corpus we got specific word Also we were using here na.omit function to delete all the NA the text have to not be confused with the different length problem
Research Question
In our article, we want to analyze how jokes with lots of comments differ from jokes with few comments. We assume word frequency for these two groups, so we divide the date into two parts. Where “dicsussive” marks those texts with lots of comments under the text, and “not discussive” marks those texts with few comments on the post. We assume that the more discussable jokes will contain more harsh and discussion-inducing words (e.g., jokes about global situations, popular people, and issues), while the less discussable posts will list more neutral words that will describe people’s everyday lives more, and build the joke around that. RQ: Do harsh words or social topics in jokes affect the number of comments or disscisivness under the post.
In this vein, we think that we can observe some differences in both groups, so here we assume:
H1: We assume that the more discussed jokes will have more sharply political and other controversial words in their corpus than the less discussed posts. This may be because people have radically different views on politics and other world issues, so they will want to express their opinions in such comments or debate with other users. [2]
H2: We assume that the most neutral or mundane words, will be more common in non-discussive posts than discussion posts. This may be because people are very similar in these areas and lifestyles, so they won’t debate in the comments and leave many comments because of it.
H3: Since more negative emotions will encourage people to write comments, there will be more words associated with negative emotions in the discussion group than in the non-dissussive group.
H4: When dividing discussive and not discussive texts into topics there will be more discussible topics and words related to politics, religion and etc in discussive group and than in not discussible one
H5: In the network analysis, it will be noticeable how words in the discussive group unite on rather controversial topics, such as politics, religion and etc while in not-discussive group there will be more daily life jokes, like family, friends, celebration etc
Develop a doc_id identifier
of each reddit text. It will be useful to understand from which text corpus we got specific word Also we used here na.omit function to delete all the NA the text have to not be confused with the different length problem and use doc_id to identify the number of each text.
data <- na.omit(data)
head(data)
## # A tibble: 6 × 5
## date_utc title text text …¹ comme…²
## <date> <chr> <chr> <chr> <dbl>
## 1 2022-11-19 "What four elements are you not allowed to b… "Wha… "Nitro… 17
## 2 2022-11-19 "Why do farts smell?" "Why… "So de… 7
## 3 2022-11-06 "Two great opportunities are walking past a … "Two… "\nThe… 1
## 4 2022-12-05 "What was the snowman doing in the vegetable… "Wha… "Picki… 2
## 5 2022-11-22 "Check out \"conjunctivitis.com\"." "Che… "Its a… 1
## 6 2022-11-21 "A plane is sitting at the terminal and is s… "A p… "Depar… 2
## # … with abbreviated variable names ¹`text (without title)`, ²comments
# all data is cleared from N/As hooray
data <- data %>% mutate(doc_id = row_number())
# each doc_id identify the number of the each text
head(data)
## # A tibble: 6 × 6
## date_utc title text text …¹ comme…² doc_id
## <date> <chr> <chr> <chr> <dbl> <int>
## 1 2022-11-19 "What four elements are you not allow… "Wha… "Nitro… 17 1
## 2 2022-11-19 "Why do farts smell?" "Why… "So de… 7 2
## 3 2022-11-06 "Two great opportunities are walking … "Two… "\nThe… 1 3
## 4 2022-12-05 "What was the snowman doing in the ve… "Wha… "Picki… 2 4
## 5 2022-11-22 "Check out \"conjunctivitis.com\"." "Che… "Its a… 1 5
## 6 2022-11-21 "A plane is sitting at the terminal a… "A p… "Depar… 2 6
## # … with abbreviated variable names ¹`text (without title)`, ²comments
Data preporation
At this point we begin to prepare our date for analysis. First we remove values from the date such as NA, next we clean up the text data to tokenize and create freq_list to take a closer look at the most popular words in each segment.
1.Delete other strange signs which appeared in the text. I think this signs were the emojis or other figures that after parsing turned out to be these numbers within the text.
Dividing our data into two groups. Especially, we will divide our jokes for those which were “discussive” and more popular and less discusiive, we will name it “not discussive”. As it will be stand out in our hypothesis, we are suggesting that these two groups will have different types of the most common words as some words will lead to discussion and people want to share their opinion about it and other will not so interesting for people so they will put lesser amount of reaction on it.(more about our hypothesis you can see in the presentation of our final work). We name the texts which have more then 100 words discussive, because there are the the mean of 72 comments per post and the histogram of comments is really right skewed that means that the outliers (or discussive texts are lying in the segments of => 100 comments per post. You can also notice that only a few comments in our histogram have more than 500 posts, these are units and it will be incorrect to compare these two groups, as we wanted to do before.
I make filtering for the texts which have less then 15 words in a corpus as there are not so meaningful for our analysis and adding noise for the further research of the texts.
data <- data %>% mutate(text = stri_replace_all_regex(text,
pattern=c("\n","\030", "\031", "\032", "\033", "\034", "\035", "\036", "\037", "\038", "\039", "\f", "&", "#x200B", "\"", "\"v\"", "\"z\"" , "\"w\"", "\"ph\"", "\"o\"", "\"e\""),
replacement=c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""),
vectorize=FALSE))
# deleting all the signs and strange groups of numbers that appear in further analysis
Main Part
In the Main Part we will start our analysis and as different methods that fit our hypotheses we have chosen: Network analisis, Sentimental Analisis and Lexical Statistics.
Frequency list and list of top words
At this stage, we want to do a primary analysis of the word groups and see which words occur frequently in both groups, make a list of the top words and visualize this with graphs. To begin with, we divided the date into 2 separate docs, where one presents the tests of only the group “Dissusive” and the other “Not dissusive”.After partitioning, we made tokens from each date for both groups.
datadiss <- filter(data, Sentiment == "discussive")
datanotdiss <- filter(data, Sentiment == "not discussive")
data_words1 <- datadiss %>% dplyr::select(Sentiment, text) %>% tidytext::unnest_tokens(word, text)
head(data_words1)
## # A tibble: 6 × 2
## Sentiment word
## <chr> <chr>
## 1 discussive a
## 2 discussive redneck
## 3 discussive is
## 4 discussive pulled
## 5 discussive over
## 6 discussive by
data_words2 <- datanotdiss %>% dplyr::select(Sentiment, text) %>% tidytext::unnest_tokens(word, text)
head(data_words2)
## # A tibble: 6 × 2
## Sentiment word
## <chr> <chr>
## 1 not discussive what
## 2 not discussive four
## 3 not discussive elements
## 4 not discussive are
## 5 not discussive you
## 6 not discussive not
At this point we removed the stop words from both groups and left only the words we needed. Further on in the process of analysis we noticed that among the useful and relevant data I have more verbs and numbers in the list of top words, so we decided to clean up the date and remove these words.
Here you can see how custom stop words are created and added to stop_words.
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
#Here you can see how custom stop words are created and added to stop_words.
custom_stop_words <- tribble(
#Matching column names like in stop_words
~word, ~lexicon,
# Add http, win, and t.co as custom stop words
"says", "CUSTOM",
"told", "CUSTOM",
"guy", "CUSTOM",
"day", "CUSTOM",
"replies", "CUSTOM",
"replied", "CUSTOM",
"first", "CUSTOM",
"walks", "CUSTOM",
"hey", "CUSTOM",
"kid", "CUSTOM",
"said", "CUSTOM",
"two", "CUSTOM",
"tells", "CUSTOM",
"takes", "CUSTOM",
"x200b", "CUSTOM",
"wife", "CUSTOM",
"doctor", "CUSTOM",
"woman", "CUSTOM",
"car", "CUSTOM",
"time", "CUSTOM",
"priest", "CUSTOM",
"f150", "CUSTOM"
)
# Bind the custom stop words to stop_words
stop_words2 <- stop_words %>%
bind_rows(custom_stop_words)
head(stop_words2)
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
enstopwords <- data.frame(word=stopwords("en"), stringsAsFactors=FALSE)
data.nonstopd <- data_words1 %>%
anti_join(stop_words2)
enstopwords <- data.frame(word=stopwords("en"), stringsAsFactors=FALSE)
data.nonstopnd <- data_words2 %>%
anti_join(stop_words2)
data.nonstopd=data.nonstopd%>%
filter(!str_detect(word, "[0-9]+"))
data.nonstopnd=data.nonstopnd%>%
filter(!str_detect(word, "[0-9]+"))
Disscussive group
Corpus statistics
Then we decided to create a frequency list, calculated the total size of the corpus and Vocabulary size, then selected the 30 most popular and most frequently encountered words and made a plot to distribute the most frequently encountered words. First we did this for the disscussive group, and then we did it for the non-disscussive group.
So we got that for the disscussive group the total size of the corpus = 4916 and Vocabulary size = 2138.
freqlistdiss <- data.nonstopd %>% count(word, sort=TRUE)
freqlistdiss
## # A tibble: 2,128 × 2
## word n
## <chr> <int>
## 1 father 22
## 2 peter 21
## 3 monk 19
## 4 home 18
## 5 call 17
## 6 hell 17
## 7 rolls 17
## 8 door 16
## 9 people 16
## 10 dog 15
## # … with 2,118 more rows
sum(freqlistdiss$n)
## [1] 3916
nrow(freqlistdiss)
## [1] 2128
freq_top30d <- freqlistdiss %>% top_n(30)
head(freq_top30d)
## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 father 22
## 2 peter 21
## 3 monk 19
## 4 home 18
## 5 call 17
## 6 hell 17
We also created a plot of the frequency distribution of the use of various words in the disscussive group. This graph shows that among the top 30 frequently used words there are both ordinary words (such as guy, rolls, kia), and those that can cause a person to leave a comment, that is, quite debatable (lover, chinese, jewish, chuck).
ggplot(freq_top30d, aes(reorder(word, -n), n)) +
geom_point(stat = "identity") +
geom_line(group = 1) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.7)) +
xlab("Words") +
ylab("Frequency") +
ggtitle("Distribution of the most frequent words in disscussive Sentiment")
You can see that the frequency in list of discussive words is much less. This would have been discussed earlier, this is because, in general, the size of the texts in the discussive lists is much smaller. Here is an interesting find. The word father comes first on the list of discussive words, and mother comes first on the list of not-discussive words. That is, most likely, this suggests that for people in our society, dad is something more discussed and everyone has a different opinion about him in life. This may be because a large percentage of people in the world grow up either without a father or with a stepfather, so the picture of the father is different for everyone and will share the discussion. In the case of mom, it seems that this is something more stable and understandable, and people do not discuss it so clearly. Here we already see the appearance of frequent words, which are really subject to discussion. For example hell, jesus. monk. Another interesting observation is that two male names can be found in the list of common words for the discussive group: Peter and Brian. Perhaps these are the heroes of some films or TV shows, so they ended up in this top list.
Wordcloud helps to make our assumtions more elustrative
library(wordcloud)
data.nonstopd %>%
dplyr::count(word) %>%
with(wordcloud(word, n, min.freq = 6,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2")))

data_freqdiss <- freqlistdiss %>% mutate(rank = row_number())
data_freqdiss
## # A tibble: 2,128 × 3
## word n rank
## <chr> <int> <int>
## 1 father 22 1
## 2 peter 21 2
## 3 monk 19 3
## 4 home 18 4
## 5 call 17 5
## 6 hell 17 6
## 7 rolls 17 7
## 8 door 16 8
## 9 people 16 9
## 10 dog 15 10
## # … with 2,118 more rows
Not Disscussive group
Corpus Statistics
We did the same with the not disscussive group. So it turned out that for the disscussive group, the total size of the corpus = 18892 and Vocabulary size = 7109. We understand that the date sizes of the discussive and not discussive groups differ markedly. We need to take this into account in further research.
freqlistnotdiss <- data.nonstopnd %>% count(word, sort=TRUE)
freqlistnotdiss
## # A tibble: 7,109 × 2
## word n
## <chr> <int>
## 1 mother 57
## 2 sex 50
## 3 father 48
## 4 husband 46
## 5 people 46
## 6 boy 45
## 7 call 45
## 8 door 45
## 9 home 44
## 10 dad 40
## # … with 7,099 more rows
sum(freqlistnotdiss$n)
## [1] 18892
nrow(freqlistnotdiss)
## [1] 7109
freq_top30nd <- freqlistnotdiss %>% top_n(30)
head(freq_top30nd)
## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 mother 57
## 2 sex 50
## 3 father 48
## 4 husband 46
## 5 people 46
## 6 boy 45
We also created a plot of the frequency of use of various words in the non-discussive group. This graph shows that among the top 30 frequently used words there are both common words (such as day, wife, guy, car, father, mother, etc.) and interesting words (such as sex, money, time).
ggplot(freq_top30nd, aes(reorder(word, -n), n)) +
geom_point(stat = "identity") +
geom_line(group = 1) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.7)) +
xlab("Words") +
ylab("Frequency") +
ggtitle("Distribution of the most frequent words in non-disscussive Sentiment")

We can see that in this case the most discussed word is anything that is somehow related to daily activities: mother, father, sex, husband, call, etc. This means that our hypothesis is that in posts that we have defined as not discussive there will be words that are neutral for a person and everyone will encounter this in their routine. Most likely, these are jokes about some comic situations in everyday life, but we can only trace this on topics, so this graph gave us a partial understanding that our hypothesis has the right to life.
This is wordcloud for not discusive. Worldcloud helps us visualize our data and see in a simpler image the frequency words in our list, as well as those that are frequency, but not enough to make it onto the graph. To do this, we identified different frequency groups that describe non-discussive posts and colored them with a specific color.
library(wordcloud)
data.nonstopnd %>%
dplyr::count(word) %>%
with(wordcloud(word, n, min.freq = 6,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2")))

data_freqnotdiss <- freqlistnotdiss %>% mutate(rank = row_number())
data_freqnotdiss
## # A tibble: 7,109 × 3
## word n rank
## <chr> <int> <int>
## 1 mother 57 1
## 2 sex 50 2
## 3 father 48 3
## 4 husband 46 4
## 5 people 46 5
## 6 boy 45 6
## 7 call 45 7
## 8 door 45 8
## 9 home 44 9
## 10 dad 40 10
## # … with 7,099 more rows
Conclusion: Given a wordcloud and a plot that shows the most popular words in each group, namely in disscussive and non-dissussive we can our first hypothesis “H1: We assume that the more discussed jokes will have more sharply political and other controversial words in their corpus than the less discussed posts.” we partially reject because we did not observe any serious difference between the two groups. But since we had the second hypothesis “We assume that the most neutral or mundane words, will be more common in non-discussive posts than discussion posts” and we partially confirm it, because if you remember what we had in rafts and in wordcloud, it turns out that in both groups neutral words are common.
Sentimental Analysis
At this point we want to test our 3rd hypothesis and find out what is the emotional percentage of words in both groups. For this we decided to use Sentimental analysis of the texts. We took our clean date (without stop words), combined it into one table, and added a library that determines by words what emotion a word carries.
dissandnondiss <- rbind(data.nonstopd,data.nonstopnd) %>%
dplyr::group_by(Sentiment) %>%
dplyr::mutate(words = n()) %>%
dplyr::left_join(tidytext::get_sentiments("nrc")) %>%
dplyr::mutate(word = factor(word),
sentiment = factor(sentiment))
head(dissandnondiss)
## # A tibble: 6 × 4
## # Groups: Sentiment [1]
## Sentiment word words sentiment
## <chr> <fct> <int> <fct>
## 1 discussive redneck 3916 <NA>
## 2 discussive pulled 3916 <NA>
## 3 discussive policeman 3916 fear
## 4 discussive policeman 3916 positive
## 5 discussive policeman 3916 trust
## 6 discussive policeman 3916 fear
Next, we added a column that counts the percentage of how many times a word occurs in each group.
percentall <- dissandnondiss %>%
dplyr::group_by(Sentiment) %>%
dplyr::group_by(word, sentiment,Sentiment) %>%
dplyr::summarise(sentiment = unique(sentiment),
sentiment_freq = n(),
words = unique(words)) %>%
dplyr::filter(is.na(sentiment) == F) %>%
dplyr::mutate(percentage = round(sentiment_freq/words*100, 1))
head(percentall)
## # A tibble: 6 × 6
## # Groups: word, sentiment [5]
## word sentiment Sentiment sentiment_freq words percentage
## <fct> <fct> <chr> <int> <int> <dbl>
## 1 absence fear discussive 1 3916 0
## 2 absence negative discussive 1 3916 0
## 3 absence sadness discussive 1 3916 0
## 4 absolute positive discussive 1 3916 0
## 5 absolute positive not discussive 1 18892 0
## 6 accidentally surprise discussive 1 3916 0
And here we have just the graph of the emotions, which shows as a percentage how many times an emotion occurs in each group. Here we can already see that in the first graph the discussive group (orange color) gets used to the negative emotions (anger, antipathy, disgust, sadness and fear) of the non-discussive group (black color), but on the other hand we can observe also that in positive emotions such as trust, joy and surprise the results are very disputable, so we decided to look further where negative emotions and positive ones prevail.
In the second graph we see the same thing: the disscussive group strongly predominates in negative and aggressive emotions than the non-dissussive group, but on the other hand also the disscussive group predominates in positive emotions like trust and anticipation.
percentall %>%
dplyr::filter(sentiment != "positive",
sentiment != "negative") %>%
ggplot(aes(sentiment, percentage, fill = Sentiment)) +
geom_bar(stat="identity",
position=position_dodge()) +
scale_fill_manual(name = "", values=c("orange", "grey30")) +
theme_bw() +
theme(legend.position = "top")

percentall %>%
dplyr::filter(sentiment != "positive",
sentiment != "negative") %>%
dplyr::mutate(sentiment = factor(sentiment,
levels = c("anger", "fear", "disgust", "sadness",
"surprise", "anticipation", "trust", "joy"))) %>%
ggplot(aes(Sentiment, percentage, fill = sentiment)) +
geom_bar(stat = "identity", position=position_dodge()) +
scale_fill_brewer(palette = "RdBu") +
theme_bw() +
theme(legend.position = "right") +
coord_flip()
Next, we decided to look at which words in each group in each emotion category contributed to the emotion score. In other words, we will explore which words are most important for emotion scores in each group. For ease of interpretation, we will remove several major emotion categories as well as polarity.
posts_impw <- dissandnondiss %>%
dplyr::filter(!is.na(sentiment),
sentiment != "anticipation",
sentiment != "surprise",
sentiment != "disgust",
sentiment != "negative",
sentiment != "sadness",
sentiment != "positive") %>%
dplyr::mutate(sentiment = factor(sentiment, levels = c("anger", "fear", "trust", "joy"))) %>%
dplyr::group_by(Sentiment) %>%
dplyr::count(word, sentiment, sort = TRUE) %>%
dplyr::group_by(Sentiment, sentiment) %>%
dplyr::top_n(3) %>%
dplyr::mutate(score = n/sum(n))
posts_impw %>%
dplyr::group_by(Sentiment) %>%
slice_max(score, n = 20) %>%
dplyr::arrange(desc(score)) %>%
dplyr::ungroup() %>%
ggplot(aes(x = reorder(word, score), y = score, fill = word)) +
facet_wrap(Sentiment~sentiment, ncol = 4, scales = "free_y") +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(x = "Words")
Thus we see that in the group discussive d in the categories anger and fear the words like hell, money, lion, bad and hell are repeated in the other categories. In the categories trust and joy more influence came from the words father, mother, love, monk. We see the opposite in the non-discussive group, where in the category anger the most popular word is bear, money and also bad; in the category fear the greatest contribution was made by bear, god, bad and so on.
In general, it is noticeable that in both categories words like god, mother, doctor, love, bad appear, and we decided to analyze the two groups further.
Polarity exploration
Next, we wanted to take a closer look at the polarities of each group, for which we need to prepare a date. And show the polarity of the words. To begin, we’ll do an overview of the non-discussive group, then we’ll do an overview of the non-discussive group.
Discussive group
To begin with, we counted the polarity in the disscussive group, that is, the mood of the group. As you can see we have calculated another statistical indicator, namely the standard deviationb standart mean, the average number of polarity, which will be useful for visualizing the raft below.
datadiss %$% polarity(text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 153 11410 0.001 0.303 0.002
dataDiss2 <- datadiss %$% polarity(text, Sentiment)
Все это нам надо было для того чтобы построить плот полярности в группе disscussive.
plot(dataDiss2)
Then we decided to build a graph that distributes the most popular words in each Sentiment category for clarity and we got just such a graph. It clearly shows which words are most often used in each sentiment. For example in the disgust group we have the most popular word - hell.
nrc_lex <- get_sentiments("nrc")
nrc_lex %>% count(sentiment)
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 anger 1245
## 2 anticipation 837
## 3 disgust 1056
## 4 fear 1474
## 5 joy 687
## 6 negative 3316
## 7 positive 2308
## 8 sadness 1187
## 9 surprise 532
## 10 trust 1230
diss_tidy <- data.nonstopd %>%
# Inner join to nrc lexicon
inner_join(nrc_lex, by = c("word" = "word")) %>%
# Drop positive or negative
filter(!grepl("positive|negative", sentiment)) %>%
# Count by sentiment and term
count(sentiment, word) %>%
# Spread sentiment, using n for values
spread(sentiment, n, fill = 0) %>%
# Convert to data.frame, making term the row names
data.frame(row.names = "word")
comparison.cloud(diss_tidy, max.words = 50, title.size = 1.5)

non-Discussive group
To begin with, we again counted the polarity in the non-disscussive group, that is, the mood of the group. As you can see we have calculated another statistical indicator, namely the standard deviationb standart mean, the average number of polarity, which will be useful for visualizing the raft below.
datanotdiss %$% polarity(text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 847 56228 -0.023 0.256 -0.088
datanotDiss3 <- datanotdiss %$% polarity(text, Sentiment)
This is the graph we have for the non-dissussive group, we just see that we have fewer sentences compared to the graph above and they are more crowded.
plot(datanotDiss3)
And for the non-dissussive group, we have a wordcloud where we also clearly see different words from each category. For example, our most popular word associated with anger is bear.
nrc_lex <- get_sentiments("nrc")
nrc_lex %>% count(sentiment)
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 anger 1245
## 2 anticipation 837
## 3 disgust 1056
## 4 fear 1474
## 5 joy 687
## 6 negative 3316
## 7 positive 2308
## 8 sadness 1187
## 9 surprise 532
## 10 trust 1230
nondiss_tidy <- data.nonstopnd %>%
# Inner join to nrc lexicon
inner_join(nrc_lex, by = c("word" = "word")) %>%
# Drop positive or negative
filter(!grepl("positive|negative", sentiment)) %>%
# Count by sentiment and term
count(sentiment, word) %>%
# Spread sentiment, using n for values
spread(sentiment, n, fill = 0) %>%
# Convert to data.frame, making term the row names
data.frame(row.names = "word")
# Examine
head(nondiss_tidy)
## anger anticipation disgust fear joy sadness surprise trust
## abandon 0 0 0 1 0 1 0 0
## abbot 0 0 0 0 0 0 0 2
## abuse 1 0 1 1 0 1 0 0
## accident 0 0 0 4 0 4 4 0
## accidentally 0 0 0 0 0 0 8 0
## accomplish 0 0 0 0 1 0 0 0
comparison.cloud(nondiss_tidy, max.words = 50, title.size = 1.5)

Сonclusion: onsidering all our analysis above we can confirm our hypothesis about emotions, because we can clearly see that in the disscussive group and polarity is higher and there are more words that are somehow related to aggression and negative emotions. If we recall again our hypothesis “Since more negative emotions will encourage people to write comments, there will be more words associated with negative emotions in the discussion group than in the non-dissussive group.” and recall the resulting rafts, it turns out that the disscussive group had more negatively associated words than the non-dissussive group.
Network
Relationships between words N-Gram
Relationships between words N-Gram without discussive posts
First we create bigrams
bigrams <- datadiss %>%
unnest_tokens(word, text, token = "ngrams", n = 2) %>%
separate(word, c("word1", "word2"), sep = " ")
head(bigrams)
## # A tibble: 6 × 9
## date_utc title text …¹ comme…² doc_id sum Senti…³ word1 word2
## <date> <chr> <chr> <dbl> <int> <int> <chr> <chr> <chr>
## 1 2022-11-20 A redneck is pull… "Polic… 144 604 84 discus… a redn…
## 2 2022-11-20 A redneck is pull… "Polic… 144 604 84 discus… redn… is
## 3 2022-11-20 A redneck is pull… "Polic… 144 604 84 discus… is pull…
## 4 2022-11-20 A redneck is pull… "Polic… 144 604 84 discus… pull… over
## 5 2022-11-20 A redneck is pull… "Polic… 144 604 84 discus… over by
## 6 2022-11-20 A redneck is pull… "Polic… 144 604 84 discus… by a
## # … with abbreviated variable names ¹`text (without title)`, ²comments,
## # ³Sentiment
Making tf_idf for bugrams. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents
bigrams <- datadiss %>%
unnest_tokens(bigram, text, token = "ngrams", n=2)
bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 8,263 × 2
## bigram n
## <chr> <int>
## 1 of the 51
## 2 in the 45
## 3 to the 45
## 4 and the 27
## 5 do you 25
## 6 on the 24
## 7 and says 21
## 8 in a 21
## 9 at the 19
## 10 the man 17
## # … with 8,253 more rows
bigram_tf_idf <- bigrams %>%
count(doc_id, bigram) %>%
bind_tf_idf(bigram, doc_id, n) %>%
arrange(desc(tf_idf))
head(bigram_tf_idf)
## # A tibble: 6 × 6
## doc_id bigram n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 892 3 words:lazy 1 0.25 5.03 1.26
## 2 892 describe yourself 1 0.25 5.03 1.26
## 3 892 in 3 1 0.25 5.03 1.26
## 4 892 yourself in 1 0.25 5.03 1.26
## 5 805 with t 2 0.167 5.03 0.838
## 6 945 my coffee 2 0.167 4.34 0.723
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
Visualising a network of bigrams in discussive group
R is able to visualise the relationships among words simultaneously. We can arrange the words into a network with a combination of nodes. The graphs have 3 variables:
From: The node of an edge is coming from To: the node an edge is going towards Weight: A numeric value associated with each edge.
On the graph we can see the most common word combination in discussive texts. In general, there is a sense in each of the collocations. For example, st Peter, pilot in airbus, rolls royce, hot dog. You can also see the combination of the words chenese and jewish guy. This means that there are combinations with both chenese guy and jewish guy.
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_graph <- bigram_counts %>%
filter(n > 2) %>%
graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label=name), vjust = 1, hjust = 1)
This is a more detailed chart with directionality. Here you can see from which word to which word bigrams go. That is, 1 word (from which the arrow comes) is the word with which the bigram begins, and the second word (which includes the arrow) is where the bigram ends.
Example: front -> door chenese -> guy <- jewish
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()

Pairwise Correlation for discussive group
The pairwise. correlation function plots an image with the pairwise correlation between phenotypes and provides the corresponding source matrix. he pairwise comparison method (sometimes called the ‘paired comparison method’) is a process for ranking or choosing from a group of alternatives by comparing them against each other in pairs, i.e. two alternatives at a time. Most of all I liked the combination of politics and Satan
datadiss <- datadiss %>% tidytext::unnest_tokens(word, text)
wosw_tidy <- datadiss %>%
anti_join(get_stopwords()) %>%
filter(is.na(as.numeric(word)))
wosw_tidy %>%
select(doc_id, word) %>%
head(10)
## # A tibble: 10 × 2
## doc_id word
## <int> <chr>
## 1 604 redneck
## 2 604 pulled
## 3 604 policeman
## 4 604 policeman
## 5 604 got
## 6 604 id
## 7 604 redneck
## 8 605 removed
## 9 605 reddit
## 10 605 removed
word_cors <- wosw_tidy %>%
group_by(word) %>%
filter(n() >= 5) %>%
pairwise_cor(word, doc_id, sort = TRUE)
word_cors
## # A tibble: 63,756 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 sisters sign 1
## 2 mercy sign 1
## 3 sign sisters 1
## 4 mercy sisters 1
## 5 sign mercy 1
## 6 sisters mercy 1
## 7 millionaire crocodile 1
## 8 brian crocodile 1
## 9 crocodile millionaire 1
## 10 brian millionaire 1
## # … with 63,746 more rows
The output above is useful for further exploration. Lets have a look into the terms “tmax”.
word_cors %>%
filter(correlation > .75) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
# Topic modelling in discussive group with LDA Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words.
dtm <- DocumentTermMatrix(data.nonstopd$word)
dtm <- na.omit(dtm)
#create model fit
raw.sum=apply(dtm,1,FUN=sum)
dtm=dtm[raw.sum!=0,]
#Taking 3 topics
ap_lda <- LDA(dtm, k = 3, control = list(seed = 11091987))
ap_lda
## A LDA_VEM topic model with 3 topics.
Now we’ve fitted a LDA model. We have to now explore and interpret the model.
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 6,315 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 redneck 0.000523
## 2 2 redneck 0.000386
## 3 3 redneck 0.000644
## 4 1 pulled 0.00205
## 5 2 pulled 0.000221
## 6 3 pulled 0.000831
## 7 1 policeman 0.0000171
## 8 2 policeman 0.000741
## 9 3 policeman 0.000794
## 10 1 removed 0.00153
## # … with 6,305 more rows
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()

This visualization lets us understand the 3 topics that were extracted from the corpus. The most common words in topic 1 include “hell”, “father”, “home”, which suggests it may represent relogion jokes. Those most commong in topic 2 include “father”, “door”, “Tpeter” suggesting that this topic represents jokes about religious morning.. Those most common in topic 3 include “monk”, “dog”, and “pilot”, suggesting that this topic represents family life jokes. One important observation about the words in each topic is that some words, such as “monk” and “father”, are common within two topics.
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001 | topic3 > 0.001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread
## # A tibble: 390 × 5
## term topic1 topic2 topic3 log_ratio
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 absolutely 0.000497 0.000550 0.00128 0.146
## 2 ago 0.000323 0.00161 0.00117 2.32
## 3 aiden 0.00106 0.000434 0.000836 -1.28
## 4 air 0.000679 0.000595 0.00105 -0.191
## 5 airbus 0.00341 0.00238 0.00120 -0.518
## 6 alive 0.00110 0.000627 0.000597 -0.818
## 7 american 0.00108 0.000980 0.000269 -0.140
## 8 answer 0.00112 0.000880 0.000327 -0.351
## 9 answered 0.00124 0.00211 0.00209 0.773
## 10 answers 0.00168 0.000636 0.0000162 -1.40
## # … with 380 more rows
Relationships between words N-Gram
Relationships between words N-Gram without NOT discussive posts
First we create bigrams
bigrams <- datanotdiss %>%
unnest_tokens(bigram, text, token = "ngrams", n=2)
bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 33,194 × 2
## bigram n
## <chr> <int>
## 1 to the 209
## 2 in the 199
## 3 of the 185
## 4 and the 129
## 5 on the 121
## 6 at the 113
## 7 do you 106
## 8 the man 102
## 9 and says 86
## 10 for a 82
## # … with 33,184 more rows
bigram_tf_idf <- bigrams %>%
count(doc_id, bigram) %>%
bind_tf_idf(bigram, doc_id, n) %>%
arrange(desc(tf_idf))
head(bigram_tf_idf)
## # A tibble: 6 × 6
## doc_id bigram n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 653 didn't jog 12 0.279 6.74 1.88
## 2 496 chuck norris 2 0.25 6.74 1.69
## 3 589 can't buy 2 0.222 6.74 1.50
## 4 162 irony wrinkly 1 0.2 6.74 1.35
## 5 162 of irony 1 0.2 6.74 1.35
## 6 162 opposite of 1 0.2 6.74 1.35
Visualising a network of bigrams in discussive group
R is able to visualise the relationships among words simultaneously. We can arrange the words into a network with a combination of nodes. The graphs have 3 variables:
From: The node of an edge is coming from To: the node an edge is going towards Weight: A numeric value associated with each edge.
On the graph we can see the most common word combination in discussive texts. In general, there is a sense in each of the collocations. For example, tic tacks, shop owners. You can also see the combination of the words half and cookie lemon. This means that there are combinations with both cookie guy and half lemon. The word stab is reflected by a single dot, this is because the word stab is used with the word stab most often
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_graph <- bigram_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label=name), vjust = 1, hjust = 1)
This is a more detailed chart with directionality. Here you can see from which word to which word bigrams go. That is, 1 word (from which the arrow comes) is the word with which the bigram begins, and the second word (which includes the arrow) is where the bigram ends.
Example: bad -> news half -> lemon -> cookie
If we compare the two groups we have identified, it also confirms the hypothesis #5 that most often in the corpora of the discussive group there are bigrams with religious overtones: St Peter, virgins Mary and etc, when the words in the not discussive group are more neutral
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()

Pairwise Correlation for NOT discussive group
The pairwise. correlation function plots an image with the pairwise correlation between phenotypes and provides the corresponding source matrix. he pairwise comparison method (sometimes called the ‘paired comparison method’) is a process for ranking or choosing from a group of alternatives by comparing them against each other in pairs, i.e. two alternatives at a time.
datanotdiss <- datanotdiss %>% tidytext::unnest_tokens(word, text)
wosw_tidy <- datanotdiss %>%
anti_join(get_stopwords()) %>%
filter(is.na(as.numeric(word)))
wosw_tidy %>%
select(doc_id, word) %>%
head(10)
## # A tibble: 10 × 2
## doc_id word
## <int> <chr>
## 1 1 four
## 2 1 elements
## 3 1 allowed
## 4 1 bring
## 5 1 job
## 6 1 nitrogen
## 7 1 sulfur
## 8 1 fluorine
## 9 1 tungsten
## 10 1 nsfw
word_cors <- wosw_tidy %>%
group_by(word) %>%
filter(n() >= 5) %>%
pairwise_cor(word, doc_id, sort = TRUE)
word_cors
## # A tibble: 1,438,800 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 birch beech 1
## 2 woodpecker beech 1
## 3 beech birch 1
## 4 woodpecker birch 1
## 5 beech woodpecker 1
## 6 birch woodpecker 1
## 7 hairy sophie 1
## 8 sophie hairy 1
## 9 thash shee 1
## 10 shee thash 1
## # … with 1,438,790 more rows
The output above is useful for further exploration. Lets have a look into the terms “tmax”.
word_cors %>%
filter(correlation > .95) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()

dtm <- DocumentTermMatrix(data.nonstopnd$word)
dtm <- na.omit(dtm)
#create model fit
raw.sum=apply(dtm,1,FUN=sum)
dtm=dtm[raw.sum!=0,]
ap_lda <- LDA(dtm, k = 3, control = list(seed = 11091987))
ap_lda
## A LDA_VEM topic model with 3 topics.
Now we’ve fitted a LDA model. We have to now explore and interpret the model.
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 21,180 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 elements 0.0000850
## 2 2 elements 0.0000719
## 3 3 elements 0.00000318
## 4 1 allowed 0.000563
## 5 2 allowed 0.000522
## 6 3 allowed 0.000196
## 7 1 bring 0.000140
## 8 2 bring 0.00223
## 9 3 bring 0.000512
## 10 1 job 0.00178
## # … with 21,170 more rows
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()

This visualization lets us understand the 3 topics that were extracted from the corpus. The most common words in topic 1 include “home”, “house”, “dad”, which suggests it may represent jokes about family. Those most common in topic 2 include “mother”, “started”, “people” suggesting that this topic represents jokes about mother.Those most common in topic 3 include “boy”, “hand”, and “farmer”, suggesting that this topic represents jokes about life of the boy.
Compared to the discussive group, you can see that the topics in the NOT discusiive group are more neutral and mediocre. There are no such diverse topics as religion, for example. Of course, it is also interesting that we did not come across a topic that is not related to politics in the discussive group. However, one can notice the words about politicians in the visualizations of Bigram in the discussive group
In this case, we confirm our 4th hypothesis, namely, in our model it is true that when we dividing discussive and not discussive texts into topics there will be more discussible topics and words related to politics, religion and etc in discussive group and than in not discussible one, as I wrote above
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001 | topic3 > 0.001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread
## # A tibble: 235 × 5
## term topic1 topic2 topic3 log_ratio
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 afraid 0.00111 0.0000706 0.000742 -3.97
## 2 american 0.000571 0.00103 0.00128 0.850
## 3 answers 0.00111 0.000167 0.00113 -2.73
## 4 apple 0.00141 0.000296 0.00117 -2.26
## 5 approached 0.0000784 0.000234 0.00113 1.58
## 6 baby 0.00114 0.0000838 0.000377 -3.77
## 7 bad 0.00276 0.00211 0.0000914 -0.391
## 8 ball 0.00259 0.00136 0.000859 -0.926
## 9 band 0.000105 0.00137 0.000282 3.71
## 10 bang 0.00106 0.00150 0.000485 0.503
## # … with 225 more rows
Conclusion:
In our work we used various methods of text analysis, through which we were able to fully explore the text and confirm/disprove the hypotheses that we set out at the beginning of the work. Initially, we assumed that the discussion group would have more topics of doubt for people, compared to the text from the NOT discussion group. We proved this and confirmed the hypothesis using the thematic modeling method when we took 3 topics each from the two groups for analysis and analyzed the words that occur in those topics. We also hypothesized that the discussion group would have more bigrams with more religious or other context that not all people understand. We were able to confirm this hypothesis through network analysis.
Here is again a list of our hypotheses: H1: We assume that the more discussed jokes will have more sharply political and other controversial words in their corpus than the less discussed posts. This may be because people have radically different views on politics and other world issues, so they will want to express their opinions in such comments or debate with other users. - Rejected
H2: We assume that the most neutral or mundane words, will be more common in non-discussive posts than discussion posts. This may be because people are very similar in these areas and lifestyles, so they won’t debate in the comments and leave many comments because of it. - partially confirmed
H3: Since more negative emotions will encourage people to write comments, there will be more words associated with negative emotions in the discussion group than in the non-dissussive group. - confirmed
H4: When dividing discussive and not discussive texts into topics there will be more discussable topics and words related to politics, religion and etc in discussive group and than in not discussable one - confirmed
H5: In the network analysis, it will be noticeable how words in the discussive group unite on rather controversial topics, such as politics, religion and etc while in non-discussive group there will be more daily life jokes, like family, friends, celebration etc - confirmed
We were still thinking about improvements to our work and we thought it was very important to do a similar analysis of reddit posts, only with a larger sample, again I remind you that this sample was made from posts that had been published in the last 3 months. Next you can better study exactly the word polarity on the new sample, we personally liked how the polarity showed results on our date and with further study on the new sample you can learn a lot more from the date.
Thank you all for your attention and Happy New Year
References
[1] Reddit: page “Jokes: Get Your Funny On!” [2] Binns, Amy. 2012. ‘Don’t Feed the Trolls!’ Journalism Practice 6(4):547–62. doi: 10.1080/17512786.2011.648988.